Let \(\theta\) be an unknown parameter or a vector of unknown parameters. Assume that \(\{X_1, X_2, \cdots, X_n \}\) is a set of IID random variables with distribution \(F(x; \theta)\). When the distribution is discrete, the probability mass function (PMF) is used in the likelihood function. If the distribution is continuous, the probability distribution density function (PDF) is used in the likelihood function.
We only focus on continuous distributions in this series of research tutorials. In the rest of this note, we assume \(\{X_1, X_2, \cdots, X_n \}\) is a set of IID continuous random variables with density function \(f(x; \theta)\). Discrete distributions such as binomial and Posson distributions are used occasionally to explain some concepts.
Recall that the likelihood of observing an IID random sample \(\{x_1, x_2, \cdots, x_n \}\) from the distribution with density function \(f(x;\theta)\) is given by
\[ L(\theta| \mathbf{x}) = \prod_{i = 1}^n f(x_i: \theta) \]
The MLE of \(\theta\), denoted by \(\hat{\theta}\) is te solution to the optimization problem
\[ \hat{\theta} = \arg \max_{\theta \in \Omega} L(\theta: \mathbf{x}). \]
Since the logarithm of the likelihood is easier to handle in mathematics, we define the above optimization problem using the log-likelihood, that is
\[ \hat{\theta} = \arg\max_{\theta \in \Omega} l(\theta: \mathbf{x}), \]
where \(l(\theta: \mathbf{x}) = \log [L(\theta: \mathbf{x})]\).
The method of moment (MM) and the maximum likelihood estimation (MLE) provide a single estimated value from a random sample to approximate the unknown parameter. This estimation is called point estimation. This note will focus on inferences based on MLE.
We briefly review some of the concepts introduced in an earlier note.
Parameter and Statistic: A parameter is a numerical characteristic of a population. A statistic is a numerical value calculated from a sample taken from the population.
Estimation, Estimator, and Estimate: An estimation is a method. An estimator is a formula or function defined based on a data set. An estimate is a value calculated using the formula (i.e., estimator) from the data set.
Point Estimate: a point estimate is a single value used to estimate the population parameter. It is a random number because the sample is random.
Interval estimate: an interval estimate is a range (interval) of values that contains the true value of the unknown parameter.
Remark - The point estimate is only a descriptive statistic based on the sample, not the population. For example, when considering the average starting salaries of recent statistics graduates at a university, we take a random sample of students and calculate the average salary, say $55,000. We can not say the average starting salary of all statistics graduates at the university is $55,000. A better way to describe the starting salary of the population of statistics graduates is that the starting salary is around $55,000 - This is a range! That is, we need an interval estimate to generalize the information from the sample to the population - confidence interval method!
Several measures are commonly used to assess the goodness of a point estimate. Before introducing these measures, please keep in mind that an estimate from a random sample is also random. This means that a point estimate has a distribution. We will discuss the probability distribution in a subsequent section. For now, we only assume a distribution associated with the point estimate so that we can define these measures rigorously using the definition of statistical expectation.
Assume that \(\hat{\theta}\) is a point estimate of \(\theta\) based on an estimation method. The measures of goodness of point estimate are based on estimation error which is defined to be \(\text{error}(\hat{\theta}) = \hat{\theta}- \theta\)
Bias: the bias of point estimate \(\hat{\theta}\) is defined to be \(\text{bias} (\hat{\theta}) = E(\hat{\theta} - \theta) = E(\hat{\theta}) - \theta\).
The bias of a point estimate could be positive, negative, or zero. When the bias is zero, \(E(\hat{\theta}) = \theta\), the corresponding point estimate is called unbiased point estimate. The unbiasedness is a good feature for a point estimate.
Mean Square Error (MSE): The MSE of point estimate \(\hat{\theta}\) is defined to be \(\text{MSE} (\hat{\theta}) = E[(\hat{\theta}-\theta)^2]\).
MSE calculates the average of squared error. It is a legitimate measure of the goodness of a point estimate. The smaller the MSE, the better the point estimate.
Relationship among Bias, MSE, and Variance: \(\text{MSE}(\hat{\theta}) = \text{var}(\hat{\theta}) +[\text{bias} (\hat{\theta})]^2\)
The derivation of the above relationship is straightforward by noting that \((\hat{\theta} - \theta)^2 = \{ [(\hat{\theta} - E(\hat{\theta})] + [E(\hat{\theta})-\theta]\}^2.\). Expanding this binomial with some algebraic clean-up, you will see the relationship.
Absolute Estimation Error is defined to be \(\epsilon = |\hat{\theta} - \theta|\).
We can not evaluate the above goodness measures unless the true value of the population parameter is available. This is a dilemma: we try to measure the goodness of the point estimate of an unknown parameter, but calculating these measures requires the true value of the unknown parameter.
In fact, we use measures in different ways:
In simulation studies, we assume the true value of the population parameter and generate random samples,
If prior information on the population parameter is available, we can approximate these measures based on a random sample.
We can use these measures to derive other inferential procedures such as confidence intervals.
One issue with a point estimate is that it has information about accuracy and precision. When we say the value of the population parameter is close to the sample statistics, but we don’t know how close it is, and how close is considered as close. This section introduces the general framework to derive the confidence interval for a population parameter \(\theta\) based on its point estimate \(\hat{\theta}\) with the assumption that \(\hat{\theta}\) has a known density \(f(\hat{\theta})\).
We consider imposing a bound to the absolution estimation error \(\epsilon = |\hat{\theta} - \theta| < b\), we can then calculate the probability
\[ P(|\hat{\theta} - \theta| < b) = P(\theta - b <\hat{\theta} <\theta + b ) = \int_{\theta-b}^{\theta +b} f(\hat{\theta}) d\hat{\theta} = p_0 \]
The above \(p_0\) is the derived probability that the absolute estimation error is bounded by given \(b\) with \(\theta\) being known. Under these assumptions, the inequality \(\theta - b <\hat{\theta} <\theta + b\) in the middle of the above equation tells how the point estimate is close to the parameter with probability \(p_0\). That is, with known \(\theta, b\), and \(f_{\hat{\theta}}(\cdot)\), we can calculate the probability that \(\hat{\theta}\) falls into interval \((\theta - b, \theta +b)\).
Next, we look at the above equation under different assumptions: assume we choose the value of \(p_0\), say \(p_0 = 0.95 = 95\%\). \(\hat{\theta}\) is calculated from a random sample and \(b\) is a known error bound. Note that the above equation can re-expressed into the following form
\[ P(|\hat{\theta} - \theta| < b) = P(\hat{\theta} - b <\theta <\hat{\theta} + b )= p_0. \]
The above equation says that the random interval \((\hat{\theta} - b, \hat{\theta} + b )\) has \(100p_0\%\) chance to include the true value of the parameter - This is the confidence interval with confidence level \(100p_0\% = 95\%\) with the choice of \(p_0 = 0.95\).
One piece of information that needs to be addressed is how to get \(b\) in the above discussion.
In one-sample confidence intervals of the population mean, we use either the central limit theorem of the strong assumption of normality of the population, the sample mean as the point estimate of the population mean \(\mu\) has a normal distribution, then b is the margin of error (i.e., absolution error bound of point estimation) that has the following form
\[ b = Z_{1-\alpha/2} \frac{s}{\sqrt{n}} \]
where \(\alpha = 1 - \text{given confidence level}\) which is called the significance level in testing hypotheses. \(Z_{1-\alpha/2}\) is the quantile of the distribution of point estimate \(\hat{\mu} = \bar{X}\) which is normal in the one sample confidence of population mean.
Therefore, the critical information required in deriving the confidence interval of a population parameter is the distribution of the point estimate of the parameter. It is customarily called the sampling distribution of the point estimate.
Recall the central limit theorem concerning the distribution of sample means we introduced in the elementary statistics:
Central Limit Theorem
Assume that a random sample \(\{x_1, x_2, \cdots, x_n \}\) is taken from a population with mean \(\mu\) and standard deviation \(\sigma\). Define \(\bar{X} = \sum_{i =1}^n x_i / n\) to be the point estimate of population mean \(\mu\). If the sample is large, \(\bar{X} \sim N(\mu, \sigma/\sqrt{n})\)
Remark: The central limit theorem (CLT) does not assume a specific distribution of the population but unknown mean \(\mu\) and variance \(\sigma^2\). The only vague condition is that the sample size is large. Any results derived from the CLT are called large sample (asymptotic) results.
One of the objectives of this note is to develop asymptotic results for the MLE that are similar to the above CLT so that we can make statistical inferences such as constructing confidence intervals and testing hypotheses. In the next few sections, we introduce the building blocks to be used in the MLE-based inferences.
Let \(\{x_1, x_2, \cdots, x_n \} \stackrel{\text{i.i.d}}{\sim} f_\theta(x)\), \(\theta = (\theta_1, \theta_2, \cdots, \theta_k)\) is a vector of \(k\) parameters (could be a single parameter when \(k = 1\)). The likelihood of observing the data is given by
\[ L(\theta) = \prod_{i=1}^n f_\theta(x_i) \]
with corresponding log-likelihood function in the following
\[ l(\theta) = \sum_{i=1}^n \log f_\theta (x_i). \]
The system of score equations is given by
\[ \begin{cases} \frac{\partial l(\theta)}{\partial \theta_1} = 0 , \\ \frac{\partial l(\theta)}{\partial \theta_2} = 0, \\ \cdots \cdots \cdots \\ \frac{\partial l(\theta)}{\partial \theta_k} = 0. \end{cases} \] The above system can be written in the following form
\[ \frac{\partial l(\theta)}{\partial \theta} = \mathbf{0} \ \ \text{ or } \ \ \nabla_\theta l(\theta) = \mathbf{0} \]
The mathematical notation \(\nabla\) (read /’na.bla/) is a differential operator. It is commonly used in Calculus.
The gradient vector of \(l(\theta)\) is defined to be
\[ \nabla_\theta l(\theta) = \left( \frac{\partial l(\theta)}{\partial \theta_1}, \frac{\partial l(\theta)}{\partial \theta_2}, \cdots , \frac{\partial l(\theta)}{\partial \theta_k}\right) \]
To derive the asymptotic normality of the MLE, we need to use the following result from advanced real analysis: the order of differentiation and integration is exchangeable. The proof of general results requires more advanced mathematical tools in the abstract measure theory (to be covered in the doctoral-level real analysis course).
Lemma 1: Suppose that \(f(x,\theta)\) is differentiable in \(\theta\) and there exists a function \(g(x,\theta)\) such that
\(\big| \frac{\partial f(x, \vartheta)}{\partial \vartheta} \big| \le g(x, \theta)\) for all \(x\) and \(\theta\) such that \(|\vartheta - \theta|\le \delta_0\);
\(\int_{-\infty}^\infty g(x, \theta) dx < \infty\) for all \(\theta\).
Then
\[\frac{d}{d\theta} \int_{-\infty}^\infty f(x,\theta) dx = \int_{-\infty}^\infty \frac{\partial f(x, \theta)}{\partial \theta} dx. \]
We will not prove the lemma in this note. For the likelihood function, the regularity conditions are satisfied. So we can use the lemma in statistical inference.
Fact. The expected value of the score function is equal to zero.
Proof: We will use the definition of expectation and the above lemma in the following proof. Without loss of generality, we consider the log-likelihood of observing a single data point \(f(x:\theta)\) and assume \(\theta\) to be a univariate parameter.
\[ E\left[ \frac{\partial \log f(x, \theta)}{\partial \theta} \right] \stackrel{\text{def}}{=} \int_{-\infty}^\infty \frac{\partial \log f(x, \theta)}{\partial \theta} f(x, \theta) dx \]
\[ = \int_{-\infty}^\infty \left[\frac{\partial f(x, \theta)/\partial \theta}{f(x,\theta)}\right] f(x, \theta) dx = \int_{-\infty}^\infty \frac{\partial f(x, \theta)}{\partial \theta} dx \]
\[ \stackrel{\text{switch}}{=} \frac{\partial}{\partial \theta} \int_{-\infty}^\infty f(x, \theta) dx = \frac{\partial}{\partial \theta} (1) = 0. \]
Remarks Lemma 1 introduced in this subsection is related to several big theorems in mathematics (advanced calculus and real analysis).
The two conditions in Lemma 1 are also called regularity conditions. Many statistical theorems (or procedures) require some regularity conditions needed in mathematical proofs and derivations. Make sure that the regularity conditions are satisfied in practical applications.
If the integral is definite with scalar integral limits, then above Lemma 1 is a special case of Leibniz Rule in Calculus.
More general forms of the above Lemma 1 are introduced in the real analysis textbooks under various convergence theorems such as Lebesgue Dominant Convergence (i.e., bounded convergence theorem) and Monotone convergence theorem.
Since the derivative is the limit of the rate of change (instantaneous rate), there is also a rule of exchange in the order of integral and limit.
The integral is also viewed as an infinite sum, there is also a rule of exchange of the order of summation and derivative (or summation and limit).
These types of exchange order operations are frequently used in statistics deriving asymptotic statistical results.
To conclude this subsection, we use the same idea in Lemma 1 to derive the following Lemma concerning the second-order derivative of the log-likelihood function. Lemma 2 links the two definitions of the Fisher Information to be introduced in the next subsection.
Lemma 2: Under some similar regularity conditions (as stated in Lemma 1), we have
\[ E\left[ \frac{\frac{\partial^2}{\partial \theta^2}f(x,\theta)}{f(x,\theta)}\right] = 0. \]
Proof: Using the definition of expectation and the exchange of the order of the derivative and the integral, we have
\[ E\left[ \frac{\frac{\partial^2}{\partial \theta^2}f(x,\theta)}{f(x,\theta)}\right] = \int \left[ \frac{\frac{\partial^2}{\partial \theta^2}f(x,\theta)}{f(x,\theta)}\right] f(x,\theta)dx \]
\[ = \int \frac{\partial^2}{\partial \theta^2}f(x,\theta) dx \stackrel{\text{switch}}{=} \frac{\partial^2}{\partial \theta^2} \int f(x,\theta) dx = \frac{\partial^2}{\partial \theta^2} (1) = 0. \]
The information on the variance and covariance of the MLE is contained in the Fisher information matrix. It must be explicitly specified when making inferences about the MLE of model parameters. To better understand the concept, we start with the case with single-parameter models.
Assume that \(\theta\) is a univariate parameter of the population with density \(f(x,\theta)\). Let \(\{x_1, x_2, \cdots, x_n \} \sim f(x, \theta)\) be an IID sample. We use vector \(\mathbf{x}\) to denote the set of random samples. The log-likelihood of observing the data is a function of \(\theta\) that has the following form
\[ l(\theta:\mathbf{x}) = \log L(\theta: \mathbf{x}) = \sum_{i=1}^n \log f(x_i, \theta). \]
The Fisher Information Number based on a random sample with size \(n\) is defined to be of the following form
\[ I_n(\theta) \stackrel{\text{def}}{=} E_{\mathbf{x}}\left[ \left(\frac{\partial}{\partial \theta} \log L(\mathbf{x}, \theta) \right)^2\right]. \]
Lemma 1 in the previous subsection says that
\[ E_X\left[ \frac{\partial}{\partial \theta} f(X_i, \theta)\right] = 0 \ \ \text{for} \ \ i = 1, 2, \cdots, n. \]
Therefore,
\[ \left\{E_{\mathbf{x}}\left[ \frac{\partial}{\partial \theta} \sum_{i=1}^n \log f(x_i, \theta)\right] \right\}^2 =\left\{ E_{\mathbf{x}}\left[ \frac{\partial }{\partial \theta} \log L(\theta: \mathbf{x})\right] \right\}^2 = 0 \]
Using the formula \(\text{Var}(X) = E(X^2) - [E(X)]^2\) for all random variable \(X\), we have the following important result.
Property: Let \(L(\theta)\) be the likelihood function defined based on the iid sample data \(\{x_1, x_2, \cdots. x_n \} \sim f_\theta(x)\). Then
\[ I_n(\theta) \stackrel{\text{def}}{=} E_{\mathbf{x}}\left[ \left(\frac{\partial}{\partial \theta} \log L( \theta: \mathbf{x}) \right)^2\right] - \left\{E_{\mathbf{x}}\left[ \frac{\partial}{\partial \theta} \log L(\theta: \mathbf{x})\right] \right\}^2 = \text{Var}_{\mathbf{x}}\left[\log L(\theta: \mathbf{x})\right]. \]
This means that the Fisher Information \(I_n(\theta)\) is the variance of the score function.
Next, we present an alternative definition of the Fisher Information. Note that
\[ \frac{\partial^2}{\partial \theta^2}\left[ \log L(\theta: \mathbf{x}) \right] = \frac{\partial}{\partial \theta} \left\{ \frac{\partial }{\partial \theta} \left[ \log L( \theta: \mathbf{x})\right]\right\} =\frac{\partial}{\partial \theta} \left\{ \left[ \frac{\partial L(\theta: \mathbf{x})/\partial \theta}{L(\theta: \mathbf{x})}\right]\right\} \] Using the multiplicative rule of derivative, we have
\[ = \frac{L(\theta:\mathbf{x})\frac{\partial^2}{\partial\theta^2}L(\theta:\mathbf{x}) - \left[\frac{\partial}{\partial\theta}L(\theta:\mathbf{x})\right]^2}{L^2(\theta: \mathbf{x})} \]
\[ =\frac{\frac{\partial^2}{\partial\theta^2}L(\theta:\mathbf{x}) }{L(\theta: \mathbf{x})} - \left[\frac{\frac{\partial}{\partial\theta}L(\theta:\mathbf{x}) }{L(\theta: \mathbf{x})}\right]^2. \]
From Lemma 2, the first term of the above equation is zero. Therefore,
\[ \frac{\partial^2}{\partial \theta^2}\left[ \log L(\theta: \mathbf{x}) \right] = - \left[\frac{\frac{\partial}{\partial\theta}L(\theta:\mathbf{x}) }{L(\theta: \mathbf{x})}\right]^2. \]
This means we can also define the Fisher Information number can also be derived as
\[ I_n(\theta) = - E \left\{\frac{\partial^2}{\partial \theta^2}\left[ \log L(\theta: \mathbf{x}) \right]\right\}. \]
This means that the Fisher Information number is the negative expectation of the second-order derivative of the log-likelihood.
Since we assume an IID sample in most cases, we can derive the relationship between the Fisher information number associated with the entire observed data set and that based on the individual observation in the data set. Denote \(I_0(\theta)\) as the Fisher information number of a single observation of an IID sample.
\[ I_n(\theta) = - E \left\{\frac{\partial^2}{\partial \theta^2}\left[ \log L(\theta: \mathbf{x}) \right]\right\} = -E\left\{\frac{\partial^2}{\partial \theta^2}\left[\sum_{i=1}^n \log f(\theta: x_i) \right]\right\} \]
\[ = \sum_{i=1}^n \left\{- E\left[\frac{\partial^2}{\partial \theta^2} \log f(\theta: x_i) \right] \right\} = nI_0(\theta). \]
Remarks Some comments on the Fisher Information:
The Fisher information is only a function of the parameter since it is defined as an expectation (with respect to \(\mathbf{X}\)) of the log-likelihood.
If we replace the parameter with an estimated one such as MLE, we obtain \(\widehat{I_n(\theta)} = I_n(\hat{\theta})\). \(\widehat{I_n(\theta)}\) is called observed Fisher information number!
The Fisher information number is always positive.
The Fisher information number is dependent on the sample size. This will be used in developing the asymptotic normality for the MLE in the next section.
Example: Consider an IID sample \(\{x_1, x_2, \cdots, x_n\} \sim f(x,\theta) = \theta e^{-\theta x}\). Find the Fisher information number.
Note that the log-likelihood function of \(\theta\) is given by
\[ l(\theta: \mathbf{x}) = n \log \theta -\theta \sum_{i=1}^n x_i. \]
The second order derivative of \(l(\theta)\) with respect to \(\theta\) is
\[ \frac{\partial^2}{\partial \theta^2} l(\theta) = \frac{\partial}{\partial \theta}\left( \frac{n}{\theta} - \sum_{i=1}^n x_i \right) = -\frac{n}{\theta^2}. \]
By definition,
\[ I_n(\theta) = -E\left( \frac{\partial^2}{\partial \theta^2} l(\theta, \mathbf{x})\right) = -E(-\frac{n}{\theta^2}) = \frac{n}{\theta^2}. \]
The exponential density function has another form (i.e., reparametrization) \(f(x,\beta) = \frac{1}{\beta} e^{-x/\beta}\). A natural question is whether we need to repeat the same calculation in the above example to find the Fisher information \(I_n(\beta)\). The answer is unnecessary. We can reparametrization in the Fisher information number.
Let \(\eta = \psi(\theta)\) and \(\psi(\cdot)\) is invertible, that is, \(\theta = \psi^{-1}(\eta)\) exists. Assume that two density forms of \(X\) are \(f(x, \theta)\) and \(g(x, \eta)\). Clearly, \(g(x, \eta) = f(x, \psi^{-1}(\eta))\) (why?). Assume further that \(I_n(\theta) = I_n[\psi^{-1}(\eta)]\) is known.
Due to reparametrization, the same random variable has different forms of density function. When we work on the expectation with the random variable, we should specify the associated density function. Next, we express \(I_g(\eta)\) with respect to \(I_f(\theta)\).
\[ I_{n,g}(\eta) = E_g\left[ \left(\frac{\partial}{\partial \eta} \log g(x, \eta) \right)^2 \right] = E_g\left[ \left(\frac{\partial}{\partial \eta} \log f(x, \psi^{-1}(\eta)) \right)^2 \right] \]
\[ = E_g\left[ \left(\frac{\partial}{\partial \eta} \log f(x, \psi^{-1}(\eta)) \times \frac{\partial}{\partial \eta} \psi^{-1}(\eta)\right)^2 \right] \]
\[ = E_g\left[ \left(\frac{\partial}{\partial \eta} \log f(x, \psi^{-1}(\eta)) \right)^2 \times \left(\frac{\partial}{\partial \eta} \psi^{-1}(\eta)\right)^2 \right] \]
\[ = E_g\left[ \left(\frac{\partial}{\partial \eta} \log f(x, \psi^{-1}(\eta)) \right)^2 \right]\times \left(\frac{\partial}{\partial \eta} \psi^{-1}(\eta)\right)^2 = I_{n,f}(\theta)\times \left(\frac{\partial}{\partial \eta} \psi^{-1}(\eta)\right)^2. \]
That is,
\[ I_{n,g}(\eta)=I_{n,f}(\theta)\times \left(\frac{\partial}{\partial \eta} \psi^{-1}(\eta)\right)^2. \]
For multi-parameter models (distributions), we need a Fisher information matrix to characterize the covariance structure of the MLE of the vector of multiple parameters. Denote \(f(x; \mathbf{\theta})\) be the density function of \(X\) with a k-dimensional parameters \(\mathbf{\theta} = (\theta_1, \theta_2, \cdots, \theta_k)\) (Caution: if not specified, all vectors in this series of note are column vectors. This is also true in most mathematics and statistics books and literature)
Under the same setup, the likelihood function of observing \(\{x_1, x_2, \cdots, x_n \}\sim f(x, \mathbf{\theta})\) is given by
\[ L(\mathbf{\theta}:\mathbf{x}) = \prod_{i=1}^n f(x_i, \mathbf{\theta}) \]
The gradient vector of the log-likelihood is
\[ \frac{\partial}{\partial \theta} \log L(\theta, \mathbf{x}) = \left( \frac{\partial}{\partial \theta_1} \log L(\theta, \mathbf{x}), \frac{\partial}{\partial \theta_2} \log L(\theta, \mathbf{x}), \cdots, \frac{\partial}{\partial \theta_k} \log L(\theta, \mathbf{x}),\right). \]
Denote
\[ \frac{\partial}{\partial \theta^T} \log L(\theta, \mathbf{x}) = \left( \frac{\partial}{\partial \theta_1} \log L(\theta, \mathbf{x}), \frac{\partial}{\partial \theta_2} \log L(\theta, \mathbf{x}), \cdots, \frac{\partial}{\partial \theta_k} \log L(\theta, \mathbf{x}),\right)^T \]
The Fisher information matrix of the k-dimensional parameter \(\theta\) is defined to be
\[ \mathbb{I}_n(\mathbf{\theta}) = E\left( \frac{\partial}{\partial \theta} \log L(\theta, \mathbf{x})\frac{\partial}{\partial \theta^T} \log L(\theta, \mathbf{x}) \right). \]
The explicit matrix form of the two-parameter case is
\[ \mathbb{I}_n(\mathbf{\theta}) = E \left[\left( \frac{\partial}{\partial \theta_1} \log L(\theta, \mathbf{x}), \frac{\partial}{\partial \theta_2} \log L(\theta, \mathbf{x}) \right) \left( \frac{\partial}{\partial \theta_1} \log L(\theta, \mathbf{x}), \frac{\partial}{\partial \theta_2} \log L(\theta, \mathbf{x}) \right)^T\right] \]
\[ = E \begin{bmatrix} \left(\frac{\partial}{\partial \theta_1} \log L(\theta, \mathbf{x}) \right)^2 & \frac{\partial}{\partial \theta_1} \log L(\theta, \mathbf{x})\frac{\partial}{\partial \theta_2} \log L(\theta, \mathbf{x}) \\ \frac{\partial}{\partial \theta_1} \log L(\theta, \mathbf{x})\frac{\partial}{\partial \theta_2} \log L(\theta, \mathbf{x}) & \left(\frac{\partial}{\partial \theta_2} \log L(\theta, \mathbf{x}) \right)^2 \end{bmatrix}. \]
Remarks
The Fisher information matrix is a square matrix. Its dimension is dependent on the dimension of the vector of parameters. For the case of a k-dimensional parameter vector, the dimension of the corresponding Fisher information matrix is \(k\times k\).
The individual cell element in the Fisher information matrix is expressed in the following
\[ [\mathbb{I}_n(\mathbf{\theta})]_{i,j} = \frac{\partial}{\partial \theta_i} \log L(\theta, \mathbf{x})\frac{\partial}{\partial \theta_j} \log L(\theta, \mathbf{x}) \]
The Fisher information matrix is the covariance matrix of the gradient vector \(\partial \log L(\theta: \mathbf{x})/\partial \theta\).
The Fisher information at each individual observed data point \(x_i\) is still a \(k\times k\) square matrix and is similarly given by
\[ I_0(\theta)= E \begin{bmatrix} \left(\frac{\partial}{\partial \theta_1} \log L(\theta, x_i) \right)^2 & \frac{\partial}{\partial \theta_1} \log L(\theta, x_i)\frac{\partial}{\partial \theta_2} \log L(\theta, x_i) \\ \frac{\partial}{\partial \theta_1} \log L(\theta, x_i)\frac{\partial}{\partial \theta_2} \log L(\theta, x_i) & \left(\frac{\partial}{\partial \theta_2} \log L(\theta, x_i) \right)^2 \end{bmatrix}. \]
\[ I_n(\theta) = nI_0(\theta). \]
Analogous to the case of single parameter distributions, we also have the following alternative definition of the Fisher information matrix for multi-parameter distributions
\[ \mathbb{I}_n(\mathbf{\theta}) = -E\left( \frac{\partial^2}{\partial \theta \partial \theta^T} \log L(\theta:\mathbf{x})\right). \]
where
\[ \mathbb{H}_n(\theta, \mathbf{x})=\frac{\partial^2}{\partial \theta \partial \theta^T} \log L(\theta:\mathbf{x}) = \begin{bmatrix} \frac{\partial^2}{\partial \theta_1^2} \log L(\theta, x_i) & \frac{\partial^2}{\partial \theta_1 \partial \theta_2} \log L(\theta, x_i) & \cdots & \frac{\partial^2}{\partial \theta_1 \partial \theta_k} \log L(\theta, x_i) \\ \frac{\partial^2}{\partial \theta_2 \partial \theta_1} \log L(\theta, x_i)) & \frac{\partial^2}{\partial \theta_2^2} \log L(\theta, x_i) & \cdots & \frac{\partial^2}{\partial \theta_2 \partial \theta_k} \log L(\theta, x_i) \\ \vdots & \vdots & \vdots \\ \frac{\partial^2}{\partial \theta_k \partial \theta_1} \log L(\theta, x_i)& \frac{\partial^2}{\partial \theta_k \partial \theta_2} \log L(\theta, x_i) & \cdots & \frac{\partial^2}{\partial \theta_k^2 } \log L(\theta, x_i) \end{bmatrix}_{k\times k} \]
is the well-known Hessian matrix. Hessian Matrices are widely used in optimization problems.
Caution: Hessian Matrix and Jacobian Matrix are sometimes confusing.
The Hessian matrix is the square matrix of second-order partial derivatives of the objective function to be optimized.
The Jacobian matrix is the first-order derivatives of a vector of functions such as \((f_1(\theta), f_2(\theta), \cdots, f_k(\theta))\) where \(\theta = (\theta_1, \theta_2, \cdots, \theta_k)\). The Jacobian matrix associated with the vector of functions is defined by
\[ J(\theta) = \begin{bmatrix} \frac{\partial}{\partial \theta_1} f_1(\theta) & \frac{\partial}{\partial \theta_2} f_1(\theta) & \cdots & \frac{\partial}{\partial \theta_k} f_1(\theta) \\ \frac{\partial}{\partial \theta_1} f_2(\theta) & \frac{\partial}{\partial \theta_2} f_2(\theta) & \cdots & \frac{\partial}{\partial \theta_k} f_2(\theta) \\ \vdots & \vdots & \vdots \\ \frac{\partial}{\partial \theta_1} f_k(\theta) & \frac{\partial}{\partial \theta_2} f_k(\theta) & \cdots & \frac{\partial}{\partial \theta_k} f_k(\theta) \end{bmatrix}_{k\times k} \]
\[ \left(\frac{\partial l(\theta)}{\partial \theta_1}, \frac{\partial l(\theta)}{\partial \theta_2}, \cdots, \frac{\partial l(\theta)}{\partial \theta_k}\right) \stackrel{\text{def}}{\equiv} [f_1(\theta), f_2(\theta), \cdots, f_k(\theta)] \]
\(\hspace{5mm}\) Then the Hessian Matrix associated with the objective function \(l(\theta)\) and the Jacobian Matrix associated with gradient (score) functions are identical.
We first introduce the concept of covariance and properties associated with multiple random variables.
Let \(X\) and \(Y\) be two variables, the covariance that measures the linear relationship between two random variables is defined by
\[ \text{cov}(X,Y) = E\{[X-E(X)][Y-E(Y)]\}. \]
Some properties of covariance:
\(\text{cov}(X,Y) = \text{cov}(Y,X)\).
When \(X = Y\), \(\text{cov}(X,Y) = \text{var}(X)\).
Let \(a\) and \(b\) are scalars,
Let \(Z = aX + bY\), \(E(Z) = E[aX + bY] = aE[X] + bE[Y]\) and \(\text{var}(Z) = a^2\text{var}(X) = 2ab\cdot \text{cov}(X,Y) + b^2 \text{var}(Y)\). (Prove this using the definition of the variance.)
The (linear) correlation coefficient between \(X\) and \(Y\) is given by \[ \rho = \frac{\text{cov}(X, Y)}{\sqrt{\text{var}(X)\text{var}(Y)}} \]
For a given sample \(\{(x_1, y_1), (x_2, y_2), \cdots, (x_n, y_n) \}\), the sample correlation coefficient is given by
\[ r = \frac{\left[\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})\right]/n}{\sqrt{\frac{\sum_{i=1}^n (x_i-\bar{x})^2}{n}\frac{\sum_{i=1}^n (y_i-\bar{y})^2}{n}}} = \frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2\sum_{i=1}^n (y_i-\bar{y})^2}}. \]
The variance of the linear combination of random variables \(X\) and \(Y\), \(Z = aX + bY\), involves covariance of \(X\) and \(Y\). In general, we can consider the linear combination of a sequence of random variables \(\{X_1, X_2, \cdots, X_p \}\), say \(W = \sum_{i=1}^p a_iX_i\), we can calculate the variance of \(W\) in the following
\[ \text{var}(W) = \sum_{i=1}^p a_i^2 \text{var}(X_i) + \sum_{i\ne j} a_ia_j \text{cov}(X_i, X_j) \]
The above variance of \(W\) is obtained based on the definition of variance and binomial expansion. However, we can write the linear combination of the random variance in the vector form
\[ W = (a_1, a_2, \cdots, a_p) \begin{bmatrix} X_1 \\ X_2 \\ \vdots \\ X_p \end{bmatrix}. \]
Therefore, \(W\) is the dot product of a scalar vector (coefficient of the linear combination) and a random vector. This motivates us to study random vectors.
Random Vectors and Properties
We have discussed the distribution of single random variables and the related characterization. If we have a vector of \(p\) random variables
\[ \mathbf{X} = \begin{bmatrix} X_1 \\ X_2 \\ \vdots \\ X_p \end{bmatrix}. \].
Each random variable has support (i.e., domain) \(\mathbb{R}_i \subseteq \mathbb{R}\) for \(i = 1, 2, \cdots, p\). Let
\[ \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_p \end{bmatrix}. \].
be the realization (i.e., observed data values) of random variable \(\mathbb{X}\). This means
\[ \mathbf{x} \in (\mathbb{R}_1 \times \mathbb{R}_2 \times \cdots \times \mathbb{R}_p)^T. \]
The mean of the random vector is
\[ E(\mathbf{X}) \stackrel{\text{def}}{\equiv} \begin{bmatrix} E(X_1) \\ E(X_2) \\ \vdots \\ E(X_p) \end{bmatrix} = \begin{bmatrix} \mu_1 \\ \mu_2 \\ \vdots \\ \mu_p \end{bmatrix} . \]
The variance-covariance matrix is defined by
\[ V[\mathbf{X}] \stackrel{\text{def}}{\equiv} \begin{bmatrix} \text{var} (X_1) & \text{cov}(X_1, X_2) & \cdots & \text{cov}(X_1, X_p) \\ \text{cov} (X_2,X_1) & \text{var}(X_2) & \cdots & \text{cov}(X_2, X_p) \\ \vdots & \vdots & \vdots & \vdots\\ \text{cov} (X_k,X_1) & \text{cov}(X_p,X_2) & \cdots & \text{var}(X_p) \end{bmatrix}_{p\times p} \]
Some Properties of Random Vectors
A constant vector a (vector of constants) satisfies \(E[a] = a\).
For random vectors \(\mathbf{X}\) and \(\mathbf{Y}\), \(E[\mathbf{X} + \mathbf{Y}] = E[\mathbf{X}] + E[\mathbf{Y}]\)
Let \(\mathbf{a}\) be a scalar (column) vector and \(\mathbf{X}\) be a (column) random vector. \(\mathbf{a}^T\mathbf{X}\) is well defined. Then \(\text{var}(\mathbf{a}^T\mathbf{X}) = \mathbf{a}^T\text{cov}(\mathbf{X})\mathbf{a}\).
Example Derive the variance of \(Z = aX + bY\) using the properties of random vectors. Let \(\mathbf{d} = (a,b)\)
\[ \text{var}(Z) = \text{var}(\mathbf{d}^T\mathbf{Z}) = \mathbf{d}^T\text{cov}(\mathbf{Z})\mathbf{d} \]
\[ = [a,b]\begin{bmatrix} \text{var}(X) & \text{cov}(X,Y) \\ \text{cov}(Y, X) & \text{var}(Y) \end{bmatrix} \begin{bmatrix} a\\ b \end{bmatrix} \]
Expand the above matrix to get the quadratic form
\[ \text{var}(Z) = a^2 \text{var}(X) + 2ab\times \text{cov}(X,Y) + b^2 \text{var}(Y). \]
Recall the density function of univariate random variable \(X\) with mean \(\mu\) and variance \(\sigma^2\) is given by
\[ f(x) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}, \ \ \text{ for } \ -\infty < x < \infty. \]
Example: Suppose \(X\) is the height (in inches) and \(Y\) is the weight (in pounds) of a male student in a large university. Furthermore suppose that \(X\) and \(Y\) follow normal distribution with parameters \(\mu_X=69\), \(\mu_Y=155\), \(\sigma_X=2.5\), and \(\sigma_Y=20\). Furthermore, the correlation coefficient between \(X\) and \(Y\) is \(\rho=0.55\). There may be different distributions that satisfy the given conditions. However, the given conditions uniquely determine the bivariate normal distribution that has the following joint density function
\[ f(x,y)=\frac{1}{2\pi\sigma_x\sigma_y\sqrt{1-\rho^2}}\exp\left\{-\frac{1}{2(1-\rho^2)} \left[ \left(\frac{x-\mu_x}{\sigma_x} \right)^2 + \left(\frac{y-\mu_y}{\sigma_y} \right)^2 -\rho\left(\frac{x-\mu_x}{\sigma_x} \right)\left(\frac{y-\mu_y}{\sigma_y} \right) \right] \right\}. \]
Obviously, the density for the Bivariate Normal is ugly, and it only gets worse when we consider higher dimensional joint densities of more normal distributions. We can write the density in a more compact form using matrix notation,
Denote
\[ \mathbf{x} = \begin{bmatrix} x \\ y \end{bmatrix}, \ \ \ \ \mu = \begin{bmatrix} \mu_x \\ \mu_y \end{bmatrix} \ \ \ \ \mathbf{\Sigma} = \begin{bmatrix} \sigma_x^2 & \rho\sigma_x\sigma_y \\ \rho\sigma_x\sigma_y & \sigma_y^2 \end{bmatrix} \ \ \ \ \]
The matrix form of the bivariate normal distribution is given by
\[ f(\mathbf{x}) = \frac{1}{2\pi} (\text{det}\mathbf{\Sigma})^{-1/2}\exp\left[-\frac{1}{2}(\mathbf{x}-\mu)^T \mathbf{\Sigma}^{-1}(\mathbf{x}-\mu)\right]. \]
\(\Sigma\) is the covariance matrix of \(\mathbf{x}\). The above expression can be generalized to multivariate normal distribution. The multivariate normal distribution is uniquely determined if the covariance matrix is specified.
With the above example, we can specify the vector and covariance matrix required in the bivariate normal distribution.
\[ \mu = \begin{bmatrix} 69 \\ 155 \end{bmatrix} \ \ \text{ and } \ \ \mathbf{\Sigma} = \begin{bmatrix} 2.5^2 & 0.55\times 2.5 \times 20\\ 0.55\times 2.5 \times 200 & 20^2 \end{bmatrix} = \begin{bmatrix} 6.25 & 27.5\\ 27.5 & 400 \end{bmatrix} \]
\[ \text{det}\mathbf{\Sigma} = 6.25\times 400 - 27.5^2 = 1743.75. \] and
\[ \mathbf{\Sigma}^{-1} = \frac{1}{6.25\times 400 - 27.5^2}\begin{bmatrix} 400 & -27.5 \\ -27.5 & 6.25 \end{bmatrix} = \begin{bmatrix} 0.2293907 & -0.01577061 \\ -0.01577061 & 0.003584229 \end{bmatrix}. \]
\[ f(\mathbf{x}) = 0.0038\exp \left[-\frac{1}{2}(x-69, y-155) \begin{bmatrix} 0.2294 & -0.0158 \\ -0.0158 & 0.0036 \end{bmatrix} \begin{bmatrix} x-69 \\ y-155 \end{bmatrix} \right]. \]
Notation of General Multivariate Normal Distribution
Let
\[ \mathbf{X} = \begin{pmatrix} X_1 \\ X_2 \\ \vdots \\ X_k \end{pmatrix}, \ \ \ \mathbf{\mu} = \begin{pmatrix} \mu_1 \\ \mu_2 \\ \vdots \\ \mu_k \end{pmatrix}, \ \ \text{ and } \ \ \mathbf{\Sigma} = \begin{pmatrix} \sigma_1^2 & \sigma_{12} & \sigma_{13} & \cdots & \sigma_{1k} \\ \sigma_{21} & \sigma_2^2 & \sigma_{23} & \cdots & \sigma_{2k} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ \sigma_{k1} & \sigma_{k2} & \sigma_{k3} & \cdots & \sigma_{kk} \end{pmatrix}. \]
We use the following notation to denote the k-dimensional normal distribution
\[ \mathbf{X} \sim \mathcal{N}_k(\mu, \mathbf{\Sigma}). \]
Properties of Multivariate Normal Distributions
Let \(\mathbf{X} = [X_1, X_2, \cdots, X_k]^T\) and \(\mu = E[\mathbf{X}]\), and \(\text{cov}(\mathbf{X}) = \mathbf{\Sigma} = (\sigma_{ij})\). The multivariate normal distribution is given by
\[ \mathbf{X} \rightarrow_p \mathcal{N}_k(\mu, \mathbf{\Sigma}). \]
The following are properties of multivariate normal distributions that will be used in the subsequent note.
All marginal distributions (distribution of individual component of multivariate normal) of a multivariate normal distribution are normal distributions.
All conditional distributions of a multivariate normal distribution are normal distribution.
All linear combinations of the components of multivariate normal distribution are also normal distributions.
Recall the MLE of \(\mathbf{\theta}\) based on IID \(\{x_1, x_2, \cdots, x_n \} \stackrel{\text{iid}}{\sim} f(x:\theta)\), denoted by \(\hat{\theta}\), is the solution to the following optimization problem
\[ \hat{\mathbf{\theta}} = \arg\min_{\theta \in \Theta} \log L(\theta: \mathbf{x}) \]
where \(L(\theta)\) is the likelihood of \(\theta\) that given explicitly in the following
\[ L(\theta: \mathbf{x}) = \prod_{i=1}^n f(x_i:\theta). \]
The Fisher Information of \(\theta\) is given by
\[ \mathbb{I}_n(\theta) = E\left[ \left(\frac{\partial}{\partial \theta} \log L(\theta:\mathbf{x})\right) \left( \frac{\partial}{\partial \theta} \log L(\theta:\mathbf{x}) \right)^T\right] \]
\[ = - E\left[ \frac{\partial^2}{\partial \theta \partial \theta^T} \log L(\theta:\mathbf{x})\right]. \]
Denote the Fisher Information Matrix based on individual observed data points by
\[ \mathbb{I}_0(\theta) =- E\left[ \frac{\partial^2}{\partial \theta \partial \theta^T} \log L(\theta:x)\right] = \frac{\mathbb{I}_n(\theta)}{n}. \]
With the above discussions and notations, we have the following sampling distribution of MLE of \(\theta\).
To discuss the normality of MLE of a parametric distribution, we first derive the normality of the MLE of \(\mu\) and \(\sigma^2 \equiv \theta\) in the normal distribution.
A Non-numerical Example: Assume that \(\{x_1, x_2, \cdots, x_n\} \sim N(\mu, \sigma^2)\) with density function
\[ f(x: \mu, \sigma^2) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}. \]
The likelihood function is given by
\[ L(\mu, \sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x_i-\mu)^2}{2\sigma^2}} = \left[\frac{1}{2\pi}\right]^{n/2} (\sigma^2)^{-n/2} e^{-\frac{\sum_{i=1}^n(x_i-\mu)^2}{2\sigma^2}}. \] The log-likelihood function is given by
\[ l(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2} \log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i-\mu)^2 \]
To simplify the notation, we let \(\theta = \sigma^2\). Since the derivative of a constant is always zero. The reduced log-likelihood function is given by
\[ l(\mu,\theta) = - \frac{n}{2} \log(\theta) - \frac{1}{2\theta}\sum_{i=1}^n (x_i-\mu)^2. \]
The score equations are given by
\[ \frac{\partial l(\theta, \mu)}{\partial \mu} = \frac{1}{\theta}\sum_{i=1}^n(x_i - \mu). \]
\[ \frac{\partial l(\theta, \mu)}{\partial \theta} = -\frac{n}{2}\frac{1}{\theta}+\frac{\sum_{i=1}^n (x_i-\mu)^2}{2\theta^2} = \frac{1}{2\theta}\left[ n- \frac{\sum_{i=1}^n (x_i-\mu)^2}{\theta}\right], \]
Solving the above score equation, we have
\[ \hat{\mu} = \frac{\sum_{x_i}^n}{n} \ \ \text{ and } \ \ \hat{\theta} = \frac{\sum_{i=1}^n(x_i-\hat{\mu})^2}{n}. \]
To find the Hessian matrix, we first find the following second-order partial derivatives.
\[ \frac{\partial^2 l(\theta, \mu)}{\partial \mu^2} =-\frac{n}{\theta}. \]
\[ \frac{\partial^2 l(\theta, \mu)}{\partial \mu\partial \theta} = -\frac{\sum_{i=1}^n(x_i-\mu)}{\theta^2}, \]
\[ \frac{\partial^2 l(\theta, \mu)}{\partial \theta^2} = \frac{1}{\theta^2}\left[\frac{n}{2}-\frac{\sum_{}^n(x_i-\mu)^2}{\theta} \right], \]
To calculate the Fisher Information matrix, we need to expectations of the second-order partial derivatives of the log-likelihood function. Note that
\[ E_X\left[ \sum_{i=1}^n (X_i - \mu)^2 \right] = n\theta \ \ \text{ and } \ \ E_X\left[ \sum_{i=1}^n (X_i - \mu) \right] = 0. \]
Then the Fisher Information Matrix is defined to be
\[ \mathbb{I}_n(\theta, \mu) = \begin{bmatrix} E_X \left(-\frac{\partial^2 l(\theta, \mu)}{\partial \mu^2} \right) & E_X \left(-\frac{\partial^2 l(\theta, \mu)}{\partial \mu \partial \theta} \right) \\ E_X \left(- \frac{\partial^2 l(\theta, \mu)}{\partial \theta\partial \mu}\right) & E_X \left(-\frac{\partial^2 l(\theta, \mu)}{\partial \theta^2} \right) \end{bmatrix} = \begin{bmatrix} \frac{n}{\theta} & 0 \\ 0 & \frac{n}{2\theta^2} \end{bmatrix}. \]
After simple algebra, we have the inverse of Fisher Information matrix \(\mathbb{I}^{-1}(\mu, \theta)\) in the following form
\[ \mathbb{I}^{-1}(\mu, \theta) = \begin{bmatrix} \frac{\theta}{n} & 0 \\ 0 & \frac{2\theta}{n} \end{bmatrix}. \]
Note that calculate the variance of the MLE of \(\mu\) and \(\theta\) directly in the following
\[ \text{var}(\hat{\mu}) = \text{var}\left( \frac{\sum_{i=1}^n X_i}{n}\right)= \frac{n\text{var}(X_1)}{n^2} = \frac{\sigma^2}{n} \equiv \frac{\theta}{n}. \]
\[ \text{var}(\hat{\theta}) = \text{var}\left( \frac{\sum_{i=1}^n (X_i-\mu)^2}{n}\right)= \frac{\text{var}\left(\sum_{i=1}^k \sigma\left[\frac{X_1-\mu}{\sigma}\right]^2\right)}{n^2} \]
\[ = \frac{\sigma^2\text{var}\left(\sum_{i=1}^kZ_i^2\right)}{n^2} = \frac{ \theta\text{var}\left(\chi_n^2\right)}{n^2} = \frac{\theta\times 2n}{n^2} = \frac{2\theta}{n} \equiv \frac{2\sigma^2}{n}. \]
Therefore,
\[ \text{var}\begin{bmatrix} \hat{\mu}\\ \hat{\theta} \end{bmatrix} = \begin{bmatrix} \frac{\theta}{n} & 0\\ 0 &\frac{2\theta}{n} \end{bmatrix} = \mathbb{I}_n^{-1}(\mu, \theta) \] We can show that
\[ \begin{bmatrix} \hat{\mu}\\ \hat{\theta} \end{bmatrix} \to_d \mathcal{N}\left( \begin{bmatrix} \mu \\ \theta \end{bmatrix}, \begin{bmatrix} \frac{\theta}{n} & 0\\ 0 &\frac{2\theta}{n} \end{bmatrix} \equiv \mathbb{I}_n^{-1}(\mu, \theta)\right). \] Since \(\mathbb{I}_n(\mu, \theta) = n\mathbb{I}_0(\mu, \theta)\), which means \(\mathbb{I}^{-1}_n(\mu, \theta) = \frac{1}{n}\mathbb{I}^{-1}_0(\mu, \theta)\), the above result can be re-written as
\[ \sqrt{n}\begin{bmatrix} \hat{\mu} - \mu\\ \hat{\theta} - \theta \end{bmatrix} \to_d \mathcal{N}\left( \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} \theta & 0\\ 0 & 2\theta \end{bmatrix} \equiv \mathbb{I}_0^{-1}(\mu, \theta)\right). \]
The above expression is also called \(\sqrt{n}\)- convergence (in distribution). Note that limit distribution in the square-root convergent sequence is independent of the sample size!
COUTION: In R functions
optim() and nlm(), the reported Hessian
matrices are based on the entire sample, that is, \(\mathbb{I}_n\)! We need to use
\[ \begin{bmatrix} \hat{\mu} - \mu\\ \hat{\theta} - \theta \end{bmatrix} \to_d \mathcal{N}\left( \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \mathbb{I}_n^{-1}(\mu, \theta)\right). \]
The Fisher Information matrix based on a single observation is given by
\[ \mathbb{I}_0(\theta, \mu) = \frac{\mathbb{I}(\theta, \mu)}{n} = \begin{bmatrix} -\frac{1}{2\theta^2} & 0 \\ 0 & -\frac{1}{\theta} \end{bmatrix}. \]
The inverse of \(\mathbb{I}(\theta, \mu)\) is
\[ \mathbb{I}_0^{-1}(\theta, \mu) = \begin{bmatrix} 2\theta^2 & 0 \\ 0 & \theta \end{bmatrix}. \]
\[ \sqrt{n}\begin{bmatrix} \hat{\theta} \\ \hat{\mu} \end{bmatrix} \to N\left( \begin{bmatrix} \theta \\ \mu \end{bmatrix}, \begin{bmatrix} 2\theta^2 & 0 \\ 0 & \theta \end{bmatrix} \right). \]
This means \(\text{var}(\hat{\theta}) = \text{var}(\widehat{\sigma^2}) = 2\theta^2 = 2\sigma^4.\), \(\text{var}(\hat{\mu}) = \theta = \sigma^2\), and \(\text{cov}(\widehat{\sigma^2}, \hat{\mu}) = \text{cov}( \hat{\mu}, \widehat{\sigma^2}) = 0\).
Comments: When deriving the variance of the MLE of \(\mu\) and \(\theta = \sigma^2\) above, we used the following basic facts: let \(\{Z_1, Z_2, \cdots, Z_k\}\) be IID standard normal random variables. We know that \(Z_i^2 = \chi_1^2\) for \(i = 1, 2, \cdots, k\). It can be shown that \[ \sum_{i=1}^k Z_i^2 = \underbrace{\chi_1^2 + \cdots + \chi_1^2}_\text{k terms} = \chi_k^2 \]
Example: Consider 100 data values generated from the
standard normal distribution \(N(\mu = 0,
\theta = 1)\). We next use another R function nlm()
- nonlinear maximization algorithm to find the MLE of \(\mu\) and \(\theta = \sigma^2\).
nlm() minimizes a target
function, we need to take the negative of log-likelihood in the
minimization.
#x = rnorm(100, 0, 1)
x=c(1.792805419, 0.342860677, -0.255317312, -1.364983233, -0.056015367, -1.218466020,
1.794767102, 1.384529093, 1.219889332, 0.276775271, 1.384123185, 1.041670342,
0.413670003, 0.618146851, -1.637490879, 0.268310316, -0.947094897, 0.939817150,
0.984654871, -0.336352078, -0.561400723, 0.718517844, 1.082629274, 0.671479957,
0.527354934, 0.579240071, -0.554011351, -0.413788772, 0.206899723, -1.212571475,
-0.639332501, -1.442404296, -1.135050924, -0.815642573, -0.623510346, -0.914910322,
-0.124001632, -1.452480378, -1.199932611, -0.363336267, -1.377401513, -0.178908888,
0.817410999, 0.341526532, 0.473276632, 0.405464974, 0.633386932, 0.712173086,
0.861740783, 1.383002535, -1.146744310, -0.836828115, 0.633655387, -0.685368285,
-1.017187821, -2.031682582, 0.287353704, 0.524274343, -0.847569723, -0.519261653,
0.127991838, -0.082759387, -0.971874441, -1.440512775, 0.733544637, -0.137988641,
1.194088165, 0.431961137, -0.104938116, -0.383744374, 1.958910903, 1.172291214,
0.417051786, 0.177727313, -1.014603979, -0.600795761, 2.029299608, -0.005692817,
0.342174616, 0.538847659, 1.603139491, -1.833839110, -0.371162834, 0.369336123,
0.513809154, 1.401956500, -0.629441943, -0.720552255, 0.882058910, 1.272037143,
2.171273073, -0.219238137, 0.170988722, -1.929036793, 1.890392692, -1.053202796,
-0.222000459, -0.232437479, 1.882390377, -0.531731514)
n = length(x)
###
negLogLik0 = function(A){
a = A[1]
b = A[2]
(n/2)*log(2*pi) + (n/2)*log(b) + (1/(2*b))*sum((x-a)^2)
}
nlm(negLogLik0, p = c(0,1), hessian=TRUE)
$minimum
[1] 141.0356
$estimate
[1] 0.06208025 0.98298158
$gradient
[1] -3.240075e-06 -8.725465e-06
$hessian
[,1] [,2]
[1,] 101.73130534 -0.00511875
[2,] -0.00511875 51.72531132
$code
[1] 1
$iterations
[1] 6
Note that the above observed Fisher Information Matrix is close to the following theoretical Fisher Information Matrix.
\[ \mathbb{I}_{n}(\theta, \mu) = \begin{bmatrix} \frac{n}{\theta} & 0 \\ 0 & \frac{n}{2\theta^2} \end{bmatrix} = \begin{bmatrix} \frac{100}{1} & 0 \\ 0 & \frac{100}{2\times 1} \end{bmatrix} = \begin{bmatrix} 100 & 0 \\ 0 & 50 \end{bmatrix} . \]
For ease of presentation, we consider the likelihood function with two parameters $ = (,)$. To study the property of MLE, let \(\hat{\mathbb{\theta}} = (\hat{\alpha}, \hat{\beta})\) be the MLE estimated from random sample \(\{x_1, x_2, \cdots, x_n \} \sim f_{\theta}(x)\). Denote the data likelihood function
\[ l(\alpha, \beta) = \sum_{i=1}^n \log f_\theta(x_i). \]
We take Taylor expansion of the system of score function \(\partial l(\hat{\alpha}, \hat{\beta})/\partial \alpha \approx 0\) and \(\partial l(\hat{\alpha}, \hat{\beta})/\partial \beta \approx 0\) at \((\alpha, \beta)\) in the following
\[ 0 \approx \frac{\partial l(\hat{\alpha}, \hat{\beta})}{\partial \alpha}\approx \frac{\partial l({\alpha}, {\beta})}{\partial \alpha} + \frac{\partial^2 l(\alpha, \beta)}{\partial \alpha^2} (\hat{\alpha}-\alpha) + \frac{\partial^2 l(\alpha, \beta)}{\partial\alpha \partial \beta}(\hat{\beta}-\beta). \]
\[ 0 \approx \frac{\partial l(\hat{\alpha}, \hat{\beta})}{\partial \beta}\approx \frac{\partial l({\alpha}, {\beta})}{\partial \beta} + \frac{\partial^2 l(\alpha, \beta)}{\partial \beta \partial\alpha} (\hat{\alpha}-\alpha) + \frac{\partial^2 l(\alpha, \beta)}{\partial \beta^2}(\hat{\beta}-\beta). \]
We rewrite the above two equations in the matrix form and obtain
\[ \begin{pmatrix} 0 \\ 0 \end{pmatrix} = \begin{pmatrix} \frac{\partial l(\alpha, \beta)}{\partial \alpha} \\ \frac{\partial l(\alpha, \beta)}{\partial \beta} \end{pmatrix} + \begin{pmatrix} \hat{\alpha} - \alpha \\ \hat{\beta} - \beta \end{pmatrix} \begin{pmatrix} \frac{\partial^2 l(\alpha, \beta)}{\partial \alpha^2} & \frac{\partial^2 l(\alpha, \beta)}{\partial \alpha \partial \beta} \\ \frac{\partial^2 l(\alpha, \beta)}{\partial \beta \partial \alpha } & \frac{\partial^2 l(\alpha, \beta)}{\partial \beta^2} \end{pmatrix}. \]
If the second-order partial derivative matrix is invertible, we reexpress the above matrix equation inti the following form
\[ \sqrt{n}\begin{pmatrix} \hat{\alpha} - \alpha \\ \hat{\beta} - \beta \end{pmatrix} = -\sqrt{n} \begin{pmatrix} \frac{\partial l(\alpha, \beta)}{\partial \alpha} \\ \frac{\partial l(\alpha, \beta)}{\partial \beta} \end{pmatrix} \begin{pmatrix} \frac{\partial^2 l(\alpha, \beta)}{\partial \alpha^2} & \frac{\partial^2 l(\alpha, \beta)}{\partial \alpha \partial \beta} \\ \frac{\partial^2 l(\alpha, \beta)}{\partial \beta \partial \alpha } & \frac{\partial^2 l(\alpha, \beta)}{\partial \beta^2} \end{pmatrix}^{-1}. \]
Note that
\[ -\begin{pmatrix} \frac{\partial^2 l(\alpha, \beta)}{\partial \alpha^2} & \frac{\partial^2 l(\alpha, \beta)}{\partial \alpha \partial \beta} \\ \frac{\partial^2 l(\alpha, \beta)}{\partial \beta \partial \alpha } & \frac{\partial^2 l(\alpha, \beta)}{\partial \beta^2} \end{pmatrix}^{-1} \to_p \mathbb{I}^{-1}_n(\alpha, \beta) \ \ \text{ and } \ \ \begin{pmatrix} \frac{\partial l(\alpha, \beta)}{\partial \alpha} \\ \frac{\partial l(\alpha, \beta)}{\partial \beta} \end{pmatrix} \to_d \mathcal{N}\left( \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \mathbb{I}_n(\alpha, \beta) \right) \]
Using the formula \(\text{var}(AX) = A^T \text{var}(X) A\) for any scalar matrices, we have
\[ \text{var}\left\{ -\sqrt{n} \begin{pmatrix} \frac{\partial l(\alpha, \beta)}{\partial \alpha} \\ \frac{\partial l(\alpha, \beta)}{\partial \beta} \end{pmatrix} \begin{pmatrix} \frac{\partial^2 l(\alpha, \beta)}{\partial \alpha^2} & \frac{\partial^2 l(\alpha, \beta)}{\partial \alpha \partial \beta} \\ \frac{\partial^2 l(\alpha, \beta)}{\partial \beta \partial \alpha } & \frac{\partial^2 l(\alpha, \beta)}{\partial \beta^2} \end{pmatrix}^{-1} \right\} = n [\mathbb{I}_n^{-1}(\alpha, \beta)]^T\text{var}\left[ \begin{pmatrix} \frac{\partial l(\alpha, \beta)}{\partial \alpha} \\ \frac{\partial l(\alpha, \beta)}{\partial \beta} \end{pmatrix} \right] \mathbb{I}_n^{-1}(\alpha, \beta) \]
\[ =n [\mathbb{I}_n^{-1}(\alpha, \beta)]^T\mathbb{I}(\alpha, \beta) \mathbb{I}_n^{-1}(\alpha, \beta) = n\mathbb{I}_n^{-1}(\alpha,\beta) = \left(\frac{\mathbb{I}_n(\alpha, \beta)}{n}\right)^{-1} = \mathbb{I}_0^{-1}(\alpha, \beta). \]
We used the fact that the Fisher Information Matrix is symmetric in the above derivation. To summarize, we have the following theorem
Theorem: (Asymptotic normality of MLE) Under some regularity conditions, the MLE of \(\theta\) based on \(\{x_1, x_2, \cdots, x_n \}\stackrel{\text{iid}}{\sim} f(x:\theta)\) has the following asymptotic normality
\[ \sqrt{n}(\hat{\theta} - \theta) \stackrel{\text{approx}}{\rightarrow_d} \mathcal{N}\left[\mathbf{0}, \mathbb{I}_0^{-1}(\theta)\right]. \]
\(\rightarrow_d\) means the convergence to a distribution. Since \(\theta\) in the covariance matrix \(\mathbb{I}_0^{-1}\) is unknown. When making inferences, we use \(\widehat{\mathbb{I}_0(\theta)} = \mathbb{I}_0(\hat{\theta})\).
Remarks:
\(\sqrt{n}(\hat{\theta} - \theta)\) can be considered as a sequence of random variables (indexed by the sample size \(n\)). \(\rightarrow_p\) stands for converges in distribution.
\((\hat{\theta} - \theta)\) is approximately normally distributed when the sample size is large. The variance of \((\hat{\theta} - \theta)\) is \(\mathbb{I}_n^{-1}(\theta)\). That is,
\[ (\hat{\theta} - \theta) \sim \mathcal{N}\left[ \mathbb{0}, \mathbb{I}^{-1}_n(\theta)\right]. \]
Example: The following data represent active repair times (in hours) for an airborne communication transceiver.
0.2, 0.3, 0.5, 0.5, 0.5, 0.5, 0.6, 0.6, 0.7, 0.7, 0.7, 0.8, 0.8, 1.0, 1.0, 1.0, 1.0, 1.1, 1.3, 1.5, 1.5, 1.5, 1.5, 2.0, 2.0, 2.2, 2.5, 3.0, 3.0, 3.3, 3.3, 4.0, 4.0, 4.5, 4.7, 5.0, 5.4, 5.4, 7.0, 7.5, 8.8, 9.0, 10.3, 22.0, 24.5.
Assuming the above data set was taken from a Weibull distribution with density function \(f(x, \alpha, \beta) = \alpha\beta x^{\beta-1} e^{-\alpha x^\beta}\).
As we derived in the previous note, the log-likelihood of observing the data is a function of \(\alpha\) and \(\beta\)
\[ l(\alpha, \beta) = n[\log(\alpha) + \log (\beta)] + (\beta -1)\sum_{i=1}^n\log (x_i)-\alpha\sum_{i=1}^n x_i^\beta \]
library(plotly)
# Data set
dat = c(0.2, 0.3, 0.5, 0.5, 0.5, 0.5, 0.6, 0.6, 0.7, 0.7, 0.7, 0.8, 0.8, 1.0, 1.0,
1.0, 1.0, 1.1, 1.3, 1.5, 1.5, 1.5, 1.5, 2.0, 2.0, 2.2, 2.5, 3.0, 3.0, 3.3,
3.3, 4.0, 4.0, 4.5, 4.7, 5.0, 5.4, 5.4, 7.0, 7.5, 8.8, 9.0, 10.3, 22.0, 24.5)
n = length(dat)
x1 = seq(0.1, 0.5, length = 50)
y1 = seq(0.1, 1, length = 50)
## log-likelihood
zfun <- function(x1,y1){
exp(n*log(x1) + n*log(y1) + (y1-1)*sum(log(dat)) - x1*sum(dat^y1))
}
z = outer(x1, y1, zfun)
plot_ly(x = x1, y = y1, z = z) %>% add_surface()
The score equations are given by
\[ \begin{cases} \frac{\partial l(\alpha, \beta)}{\partial \alpha} = \frac{n}{\alpha} - \sum_{i=1}^n x_i^\beta= 0 , \\ \frac{\partial l(\alpha, \beta)}{\partial \beta}= \frac{n}{\beta} + \sum_{i=1}^n \log(x_i) - \alpha \sum_{i=1}^n x_i^\beta \log(x_i)= 0. \end{cases} \]
We next write R code to find the MLE and the Hessian matrix (the negative observed Fisher Information Matrix).
CAUTION:
in the R optimization function
optim() reports the hessian matrix based on
entire observations.
# Data set
x = c(0.2, 0.3, 0.5, 0.5, 0.5, 0.5, 0.6, 0.6, 0.7, 0.7, 0.7, 0.8, 0.8, 1.0, 1.0, 1.0, 1.0,
1.1, 1.3, 1.5, 1.5, 1.5, 1.5, 2.0, 2.0, 2.2, 2.5, 3.0, 3.0, 3.3, 3.3, 4.0, 4.0, 4.5,
4.7, 5.0, 5.4, 5.4, 7.0, 7.5, 8.8, 9.0, 10.3, 22.0, 24.5)
n = length(x)
## log-likelihood
negLogLik = function(A){
a = A[1]
b = A[2]
n*log(a) + n*log(b) + (b-1)*sum(log(x)) - a*sum(x^b)
}
## gradient function
grFun = function(A){
a = A[1]
b = A[2]
ga = n/a - sum(x^b)
gb= n/b +sum(log(x)) -a*sum(x^b*log(x))
c(ga,gb)
}
## calling R function optim()
results = optim(par = c(.5,.5), fn = negLogLik, gr = grFun, method = "BFGS",
control = list(maxit = 20000,fnscale = -1), hessian = TRUE)
MLE = results$par
Counts = results$counts
Convergence = results$convergence
Hessian = results$hessian
I0.inv = -solve(Hessian)
out = list(MLE = MLE, Counts = Counts, Convergence = Convergence, Hessian = Hessian, I0.inv = I0.inv)
out
$MLE
[1] 0.3377259 0.8896178
$Counts
function gradient
33 9
$Convergence
[1] 0
$Hessian
[,1] [,2]
[1,] -394.5369 -236.5124
[2,] -236.5124 -250.2269
$I0.inv
[,1] [,2]
[1,] 0.005848388 -0.005527848
[2,] -0.005527848 0.009221250
We evaluate the gradient vector at the MLE in the following. Theoretically speaking, the gradient vector is supposed to be close to 0. However, how the gradient vector is close to zero is dependent on the shape of the likelihood surface near the MLE.
grFun(c(0.3377259, 0.8896178))
[1] -6.173184e-06 -5.231722e-06
Therefore, the sampling distribution of \((\hat{\alpha}, \hat{\beta})\) is given by
\[ \begin{pmatrix} \hat{\alpha} - \alpha \\ \hat{\beta} - \beta \end{pmatrix} \rightarrow_p \mathcal{N} \left[ \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 0.005848388 & -0.005527848 \\ -0.005527848 & 0.009221250 \end{pmatrix} \right] \]
Remarks
Caution:
The Hessian matrix \(\mathbb{H}\) reported in
optim() is negative observed Fisher information: \(-\widehat{\mathbb{I}_n(\theta)} =
-\mathbb{I}_n(\hat{\theta})\)!
The covariance matrix in the asymptotic normality of MLE requires \(\mathbb{I}_0(\theta)\). The observed Fisher information \(\mathbb{\hat{\theta}}\) is equal to the Hessian matrix, \(\mathbb{H}_n\), divided by sample size \(n\). That is, \(\mathbb{I}_0(\hat{\theta}) = -\mathbb{H(\theta:\mathbf{x})}/n.\)
\(\text{var}[\hat{\alpha}-\alpha] = \text{var}(\hat{\alpha}) = 0.005848388\); \(\text{var}(\hat{\beta}) = 0.4149552/45 \approx 0.009221227\); and \(\text{cov}(\hat{\alpha}, \hat{\beta}) = -0.2487531/45 = -0.005527847\).
Let \(\hat{\omega} = c\hat{\alpha} + d\hat{\beta}\). Then \(\text{var}(\hat{\omega}) = \text{var}[ c\hat{\alpha} + d\hat{\beta}]\) \(= c^2\text{var}(\hat{\alpha}) + d^2\text{var}(\hat{\beta}) + 2cd\text{cov}(\hat{\alpha}, \hat{\beta})\) \(= 0.005848411c^2 + 0.009221227d^2 - 2\times 0.005527847cd\).
In the following, we use nlm() to find the MLE of \(\alpha\) and \(\beta\). Note that the hessian matrix is
reported in nlm() at the minimum. Hence, it is a positive
semi-definite matrix. It is different from the hessian matrix reported
in optim() which is negative semi-definite.
# Data set
x = c(0.2, 0.3, 0.5, 0.5, 0.5, 0.5, 0.6, 0.6, 0.7, 0.7, 0.7, 0.8, 0.8, 1.0, 1.0, 1.0, 1.0,
1.1, 1.3, 1.5, 1.5, 1.5, 1.5, 2.0, 2.0, 2.2, 2.5, 3.0, 3.0, 3.3, 3.3, 4.0, 4.0, 4.5,
4.7, 5.0, 5.4, 5.4, 7.0, 7.5, 8.8, 9.0, 10.3, 22.0, 24.5)
n = length(x)
##
negLogLik0 = function(A){
a = A[1]
b = A[2]
-(n*log(a) + n*log(b) + (b-1)*sum(log(x)) - a*sum(x^b)) # negate log-likelihood!
}
nlm(negLogLik0, p = c(1,1), hessian=TRUE)
$minimum
[1] 102.3452
$estimate
[1] 0.3377254 0.8896176
$gradient
[1] -5.566392e-05 -4.544631e-05
$hessian
[,1] [,2]
[1,] 394.3012 236.5408
[2,] 236.5408 250.2637
$code
[1] 1
$iterations
[1] 12